Patrick Cao
Insider trading, or trading financial assets using information that is not open to the public, is punishable by law in the United States. It is a serious violation -- a maximum penalty of 20 years in prison, and up to $20 million in fines. The Securities and Exchange Commission (SEC) is responsible for identifying and conducting investigations regarding insider trading.
With that being said, U.S. senators (and other congresspeople) often have access to information about upcoming policies or laws before the general public, yet are still allowed to trade securities. Senators, however, must disclose their trades publicly. More recently, a few senators have been investigated by the SEC for insider trading using nonpublic knowledge given to them about the COVID-19 pandemic. With the rise in popularity of investing, senator stock trading patterns has become to the public's attention. You may have seen posts on the internet implying that there is rampant insider trading going on in the senate. For example, take this post on Reddit that was upvoted over 30,000 times: https://www.reddit.com/r/dataisbeautiful/comments/gjlvnd/.
Is insider trading in the senate as rampant as this post suggests? This tutorial aims to identify potential insider trading patterns by the U.S. senate. In order to answer this question, we will be using data from https://senatestockwatcher.com/, which itself compiles data from https://efdsearch.senate.gov/. efdsearch.senate.gov does not have an official API, and scraping each page would be extremely tedious. Senate stock watcher aggregates financial disclosures from all senators into one nice dataset.
Let's first start by downloading the dataset. While Senate Stock Watcher provides an API for up to date datasets, we can just download a csv file as well. Download the "All Transactions" csv file from https://senatestockwatcher.com/api.html.
Let's load that into a pandas dataframe.
import pandas as pd
import numpy as np
import datetime
df = pd.read_csv('all_transactions.csv')
df.head()
Let's clean this up a little by limiting and purchases and sales only, and dropping unnecessary columns.
start_date = '2016-01-01'
end_date = '2020-11-01'
# Turn transaction_date from string into datetime
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
# Remove (Full) and (Partial) after Sale type
df['type'] = df['type'].apply(lambda x: str(x).split()[0])
# Limit to only purchases and sales
df = df[(df['type'] == 'Purchase') | (df['type'] == 'Sale')]
# limit the dataset to a date range
df = df[(df['transaction_date'] > start_date) & (df['transaction_date'] < end_date)]
df = df.drop(['asset_description', 'comment', 'ptr_link'], axis=1)
df.head()
We got our senator stock trading data, but now we need actual historical market data to visualize these trades. Let's use yfinance, a python Yahoo! Finance library, to get historical market data on SPY, an ETF tracking the S&P500.
import yfinance as yf
spy = yf.Ticker('SPY')
spy_hist = spy.history(period='max')
spy_hist.tail()
Neat! Now we can start putting these two things together.
import matplotlib.pyplot as plt
# number of trades by senators
num_trades = df.groupby('senator').count().sort_values(by='transaction_date', ascending=False).head(10)
plt.title('Top 10 Most Active Senators')
plt.xlabel('Total Number of Trades')
plt.ylabel('Senator')
plt.barh(num_trades.index, num_trades['transaction_date'])
plt.show()
# number of trades by ticker
trades_by_ticker = df[df['ticker'] != '--'].groupby('ticker').count().sort_values(by='transaction_date', ascending=False).head(20)
plt.title('Top 20 Most Traded Stocks')
plt.xlabel('Total Number of Trades')
plt.ylabel('Ticker')
plt.barh(trades_by_ticker.index, trades_by_ticker['transaction_date'])
plt.show()
# values of trades
trade_val = df.groupby('amount').count().sort_values(by='transaction_date', ascending=False)
plt.title('Value of Trades')
plt.xlabel('Number of Trades')
plt.ylabel('Value of Trade ($)')
plt.barh(trade_val.index, trade_val['transaction_date'])
plt.show()
It looks like most trades by senators are small -- under $15,000 per trade -- and they like trading AAPL, which is coincidentally the largest market cap company in the world. There's a good mix of tech, pharma, communications, and retail stocks that are the most commonly traded in the senate. There's not too much insight we can pull from this, but it is interesting.
Let's now try to recreate the reddit post. We need to plot overall senator trading activity against the S&P500, or SPY.
It's a little tough to track performance exactly. The size of each trade isn't an exact number -- it's a range. Let's check out what kinds of values we're working with here.
df['amount'].unique()
Yikes. The data doesn't have the exact value of securities traded, the best thing we can probably do is estimate the size of each trade by just taking the middle value of each trade. Unfortunately, the granularity of each bucket is pretty big, but we'll have to try and make do.
# Returns the average of the given bound
def getBounds(row):
amount = row[0]
trade_type = row[1]
# This amount shows up once in the entire dataset. Let's just assume the trade is $50m for now. It won't affect our analysis too much.
if amount == 'Over $50,000,000':
return 5e7
split = amount.split(' - ')
lower = int(split[0][1:].replace(',', ''))
upper = int(split[1][1:].replace(',', ''))
mid = lower + ((upper - lower) / 2)
if trade_type == 'Purchase':
return mid
return -mid
df['est_amt'] = pd.Series(df[['amount', 'type']].apply(getBounds, axis=1))
df
Alright, now we got some rough estimates of the value of each transaction, we can group trades together by date and find the aggregate for each day.
aggregate_action_with_type = df.groupby([pd.Grouper(key='transaction_date', freq='D'), 'asset_type', 'type']).agg(est_amt=('est_amt', 'sum'), num_trades=('est_amt', 'count')).reset_index()
# because a sale is the opposite of a purchase, we want to show that
aggregate_action_by_asset = aggregate_action_with_type.copy()
aggregate_action_by_asset['num_purchases'] = aggregate_action_by_asset.apply(lambda row: -row['num_trades'] if row['type'] == 'Sale' else row['num_trades'], axis=1)
aggregate_action_by_asset = aggregate_action_by_asset.groupby(['transaction_date', 'asset_type']).sum().reset_index()
aggregate_action = aggregate_action_by_asset.groupby('transaction_date').sum().reset_index()
print(aggregate_action_by_asset.head())
print(aggregate_action.head())
We are now ready to plot senator trades against the market.
spy_hist['Date'] = spy_hist.index
# since we've limited the senator trades to a date range, we want to also limit SPY history
# to the same date range
market = spy_hist[(spy_hist['Date'] > start_date) & (spy_hist['Date'] < end_date)]
def plot_trades_vs_spy(trades, asset_type, metric, y_label):
# We have a lot of data, so we should make the figure very wide
plt.figure(figsize=(18,6))
plt.plot(market['Date'], market['Close'])
plt.xlabel('Year')
plt.ylabel('SPY price ($)')
ax2 = plt.twinx()
# scale bar graph limits s.t. 0 is centered in graph
limit = max(abs(trades[metric].min()), trades[metric].max()) * 1.1
ax2.set_ylim(-limit, limit)
ax2.spines['bottom'].set_position(('data', 0))
plt.ylabel('Amount of Securities Traded ($)')
plt.bar(trades['transaction_date'], trades[metric], width=5, color='orange')
plt.title('Senator Trades of ' + str(asset_type) + ' (' + metric + ')' + ' vs. SPY price')
plt.show()
for asset_type in aggregate_action_by_asset['asset_type'].unique():
trades_by_asset = aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == asset_type]
plot_trades_vs_spy(trades_by_asset, asset_type, 'num_purchases', 'Number of Purchases/Week')
plot_trades_vs_spy(trades_by_asset, asset_type, 'est_amt', 'Net Securities Traded/Week ($)')
plot_trades_vs_spy(aggregate_action, 'Overall', 'num_purchases', 'Number of Purchases/Week')
plot_trades_vs_spy(aggregate_action, 'Overall', 'est_amt', 'Net Securities Traded/Week ($)')
Is there a relationship between these trades and future SPY price? It would be helpful to try and quantify this and instead plot this data on a scatterplot. Let's plot these trades vs. future SPY price as well. To find the price of SPY a number of days in the future, we need to know all the open market days. Luckily, there is a library for that.
But how long in the future should we check? It's probably not feasible to check a lot of different number of days in the future, so we have to pick this value somewhat arbitrarily. Let's just pick 1 week and 3 weeks. There are 5 trading days per week, so when looking at future trading days, we could look at SPY prices 5 and 15 days in the future.
import pandas_market_calendars as mcal
from datetime import datetime
nyse = mcal.get_calendar('NYSE')
market_open_days = nyse.valid_days(start_date=start_date, end_date=datetime.today().strftime('%Y-%m-%d'))
# returns a function that gets the spy price num_days from today
def get_future_spy_price(num_days):
def h(today):
# calculate current spy data
today_index = market_open_days.get_loc(today, method='nearest')
today_close = spy_hist.iloc[spy_hist.index.get_loc(today, method='nearest')]['Close']
# calculate future date and get spy data
future_index = today_index + num_days
future_date = market_open_days[future_index]
future_close = spy_hist.iloc[spy_hist.index.get_loc(future_date, method='nearest')]['Close']
return 100 * (future_close - today_close) / today_close
return h
aggregate_action_by_asset['5_day_spy_pct_change'] = aggregate_action_by_asset['transaction_date'].apply(get_future_spy_price(5))
aggregate_action_by_asset['15_day_spy_pct_change'] = aggregate_action_by_asset['transaction_date'].apply(get_future_spy_price(15))
aggregate_action['5_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(get_future_spy_price(5))
aggregate_action['15_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(get_future_spy_price(15))
aggregate_action
Now that we have our data, we can begin plotting. We want to plot value of trades vs. future value and number of purchases vs. future value for both 5 and 15 days in the future.
def set_subplot(ax, x, y, title, x_label, y_label):
ax.scatter(x, y)
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.grid()
def plot_trades_vs_future_spy_price(trades, asset_type):
fig, axs = plt.subplots(1, 4, figsize=(24, 5))
fig.suptitle('Trades vs. Future SPY Price % Change for ' + asset_type)
set_subplot(axs[0], trades['est_amt'], trades['5_day_spy_pct_change'], '5 day', 'Net Value of Assets Bought/Sold', '5 Day % Change in S&P500')
set_subplot(axs[1], trades['num_purchases'], trades['5_day_spy_pct_change'], '5 day', 'Number of Purchases (Sales count as negative)', '5 Day % Change in S&P500')
set_subplot(axs[2], trades['est_amt'], trades['15_day_spy_pct_change'], '15 day', 'Net Value of Assets Bought/Sold', '15 Day % Change in S&P500')
set_subplot(axs[3], trades['num_purchases'], trades['15_day_spy_pct_change'], '15 day', 'Number of Purchases (Sales count as negative)', '15 Day % Change in S&P500')
plt.show()
for asset_type in aggregate_action_by_asset['asset_type'].unique():
trades_by_asset = aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == asset_type]
plot_trades_vs_future_spy_price(trades_by_asset, asset_type)
plot_trades_vs_future_spy_price(aggregate_action, 'Overall')
There aren't any obvious trends in most of the categories. However, we can still try to uncover trends by building linear regression models and performing hypothesis tests with this data. More on that in the machine learning and hypothesis testing section.
aggregate_action_by_senator = df.groupby([pd.Grouper(key='transaction_date', freq='D'), 'asset_type', 'type', 'senator']).sum()
# Groupby makes transaction_date and type an index, we want to convert those back into a column
aggregate_action_by_senator.reset_index(inplace=True)
# limit the dataset to a date range
aggregate_action_by_senator = aggregate_action_by_senator[(aggregate_action_by_senator['transaction_date'] > start_date) & (aggregate_action_by_senator['transaction_date'] < end_date)]
aggregate_action_by_senator['est_amt'] = aggregate_action_by_senator.apply(lambda row: row['est_amt'] if row['type'] == 'Purchase' else -row['est_amt'], axis=1)
aggregate_action_by_senator['5_day_spy_pct_change'] = aggregate_action_by_senator['transaction_date'].apply(get_future_spy_price(5))
# aggregate_total['15_day_spy_pct_change'] = aggregate_total['transaction_date'].apply(get_future_spy_price(15))
# plt.figure(figsize=(30,10))
# plt.scatter(senator_actions['mid_amt'], testx['5_day_spy_pct_change'])
# plt.xlabel('Estimated Amount of Securities Purchased/Sold')
# plt.show()
# plt.figure(figsize=(30,10))
# plt.scatter(test['mid_amt'], test['15_day_spy_pct_change'])
# plt.show()
print(aggregate_action_by_senator)
for senator in aggregate_action_by_senator['senator'].unique():
senator_actions = aggregate_action_by_senator[aggregate_action_by_senator['senator'] == senator]
plt.axvline(0, color='black')
plt.axhline(0, color='black')
plt.grid()
plt.scatter(senator_actions['est_amt'], senator_actions['5_day_spy_pct_change'])
plt.show()
hists = {}
# returns a function that gets the spy price num_days from today
def get_future_stock_price(ticker, today, num_days):
# print('Finding %change in price after', num_days, 'days for', ticker, 'on', today)
try:
stock_history = yf.Ticker(ticker).history(period='5y')
# calculate current stock data
today_index = market_open_days.get_loc(today, method='nearest')
today_close = stock_history.iloc[stock_history.index.get_loc(today, method='nearest')]['Close']
# calculate future date and get stock data
future_index = today_index + num_days
future_date = market_open_days[future_index]
future_close = stock_history.iloc[stock_history.index.get_loc(future_date, method='nearest')]['Close']
return (future_close - today_close) / today_close
except Exception as e:
return np.nan
stock_trades = df[df['ticker'] != '--']
# we're processing an insane amount of data here, so let's save this dataframe back to the disk
# so we can read it back easily
try:
stock_trades = pd.read_csv('stock_trading_performance.csv')
except FileNotFoundError:
print('calculating...')
stock_trades['5_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
print('done with 5 day')
stock_trades['15_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 15), axis=1)
print('done with 15 day')
stock_trades['30_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 30), axis=1)
print('done with 15 day')
# stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
stock_trades.to_csv('stock_trading_performance.csv')
print(stock_trades['owner'].unique())
print(len(stock_trades['senator'].unique()))
stock_trades['5_day_pct_change'] = pd.to_numeric(stock_trades['5_day_pct_change'], errors='coerce')
stock_trades['15_day_pct_change'] = pd.to_numeric(stock_trades['15_day_pct_change'], errors='coerce')
for senator in stock_trades['senator'].unique():
for owner in stock_trades['owner'].unique():
print(senator, owner)
stock_trades_by_senator = stock_trades[(stock_trades['senator'] == senator) & (stock_trades['owner'] == owner)]
plt.scatter(stock_trades_by_senator['est_amt'], stock_trades_by_senator['5_day_pct_change'])
plt.show()
# plt.scatter(stock_trades['mid_amt'], stock_trades['5_day_pct_change'])
# plt.show()
# plt.scatter(stock_trades['mid_amt'], stock_trades['15_day_pct_change'])
for senator in stock_trades['senator'].unique():
stock_trades_by_senator = stock_trades[stock_trades['senator'] == senator]
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Trade Value vs. Future Stock Price % Change for Senator ' + senator)
axs[0].scatter(stock_trades_by_senator['est_amt'], stock_trades_by_senator['5_day_pct_change'])
axs[0].set_title('5 Day')
axs[0].set_xlabel('Net Value of Assets Bought/Sold')
axs[0].set_ylabel('5 Day % Change in Stock Price')
axs[1].scatter(stock_trades_by_senator['est_amt'], stock_trades_by_senator['15_day_pct_change'])
axs[1].set_title('15 Day')
axs[1].set_xlabel('Net Value of Assets Bought/Sold')
axs[1].set_ylabel('15 Day % Change in Stock Price')
axs[0].grid()
axs[1].grid()
# plt.
# plt.title('Trade vs. 5 Day Future Stock Price % Change for Senator ' + senator)
plt.show()